Predicting subjective well-being in a high-risk sample of Russian mental health app users

Despite recent achievements in predicting personality traits and some other human psychological features with digital traces, prediction of subjective well-being (SWB) appears to be a relatively new task with few solutions. COVID-19 pandemic has added both a stronger need for rapid SWB screening and new opportunities for it, with online mental health applications gaining popularity and accumulating large and diverse user data. Nevertheless, the few existing works so far have aimed at predicting SWB, and have done so only in terms of Diener’s Satisfaction with Life Scale. None of them analyzes the scale developed by the World Health Organization, known as WHO-5 – a widely accepted tool for screening mental well-being and, specifically, for depression risk detection. Moreover, existing research is limited to English-speaking populations, and tend to use text, network and app usage types of data separately. In the current work, we cover these gaps by predicting both mentioned SWB scales on a sample of Russian mental health app users who represent a population with high risk of mental health problems. In doing so, we employ a unique combination of phone application usage data with private messaging and networking digital traces from VKontakte, the most popular social media platform in Russia. As a result, we predict Diener’s SWB scale with the state-of-the-art quality, introduce the first predictive models for WHO-5, with similar quality, and reach high accuracy in the prediction of clinically meaningful classes of the latter scale. Moreover, our feature analysis sheds light on the interrelated nature of the two studied scales: they are both characterized by negative sentiment expressed in text messages and by phone application usage in the morning hours, confirming some previous findings on subjective well-being manifestations. At the same time, SWB measured by Diener’s scale is reflected mostly in lexical features referring to social and affective interactions, while mental well-being is characterized by objective features that reflect physiological functioning, circadian rhythms and somatic conditions, thus saliently demonstrating the underlying theoretical differences between the two scales.


Introduction
In recent years, evaluation, analysis and improvement of subjective well-being (SWB) has gained a growing attention of both researchers and practitioners [1,2]. Attention to SWB has naturally been coupled with the increasing research interest in depression -the leading cause of disability and subjective well-being loss worldwide [3,4]. The COVID-19 pandemic, resulting in the shift to hybrid work and the decline in face-to-face communication has put many individuals at additional mental health risks [5,6]. Some of the most widely available instruments to mitigate such risks are online and mobile services that offer quick screening tests of subjective well-being and mental health states and automatically generate respective recommendations. More than 240 mental health apps are available in the App Store today, some of which are extensively using machine learning for classifying and scoring their users in terms of their psychological or mental conditions [7][8][9]. Such apps attract consumers concerned with their psychological states, while these concerns are usually associated with higher risks for users' SWB or mental health. As these individuals agree to donate parts of their digital traces, psychological apps become natural hubs accumulating data on individuals at risk. Such data, if available, provide ample opportunities for the development of open source algorithms for early automatic detection of threats to well-being in high-risk populations with their digital traces.
Subjective well-being is most commonly defined in accordance with Diener's approach [10] as a person's satisfaction with their life (which constitutes SWB's cognitive component) and the prevalence of positive emotions over negative ones (affective balance, which constitutes SWB's affective component). To date, about 100 assessment tools measuring about 200 facets of well-being have been proposed, thus complicating the selection of relevant metrics [1]. The two most widely used SWB measurement tools are Diener's Satisfaction with Life Scale (SWLS) [10] and the scale introduced by the World Health Organization in 1998, known as the WHO-5 index [11]. The former aims to capture generalized long-term subjective well-being, while the original goal of the latter was to screen and rate depression. Later, Bech, one of the WHO-5 developers, also showed that this scale is equally good at detecting high degrees of psychological well-being, which he proposed to consider a component of mental health, along with the absence of depression symptoms [12].
Both SWLS and WHO-5 are short unidimensional 5-item scales with proven validity and reliability (α coefficients 0.79-0.89 for the former and 0.82-0.95 for the latter) [13][14][15]. Both have become common for well-being screening in a wide range of populations and among different nationalities [15][16][17][18]. The wide use and the proven quality of these metrics defines their choice for our research in automatic SWB prediction; however, some more details on their distinctive features should be added. SWLS, apart from being centered on pleasure and satisfaction, is also meant to be timeand dimension-independent. The first feature means that it is not tied to a specific time interval and measures satisfaction with our past, present and future. The second feature refers to the generalized character of such satisfaction, not being tied to any particular dimension of human life, such as health, relationships or finance. The choice of the dimensions to be taken into account and the weight assigned to them is left with the subject and is expected to be based on a blend of objective reality and the subject's subjective experience of it. It is assumed that a person is able to adequately assess her well-being and has all the necessary and unbiased information for that [10]. SWLS is widely used by psychologists, public health professionals, and economists. According to the World Happiness Report, SWLS provides a more informative measure for international comparisons of well-being than some measures capturing affective compo-nent only [19]. Importantly, SWLS is stable under unchanging conditions, but is sensitive to changes in life circumstances: thus,its growth is associated with higher likelihood of marriage and childbirth and with lower likelihood of job loss and relocating [20]. It is also predictive of physical and physiological outcomes, as judged from a 4-year follow-up period in the same study. It is these meaningful changes that have been found responsible for the drop of SWLS test-retest reliability from 0.84 in the window of a few weeks to 0.54 in the 4-year window [21]. These changes are clearly distinct from the short-term random mood fluctuations responsible for explaining 16% of variance in the short run. It thus can said that SWLS captures a stable and a transient components both of which are present in human well-being.
In contrast to SWLS, WHO-5 index aims at a brief assessment of emotional well-being over a 14-day period (thus containing no cognitive component and being highly timesensitive). Its items represent positive affect whose absence corresponds to the depression symptoms (negative affect). This is an important advantage of WHO-5 as the subjects are not forced to confess of the presence of any unpleasant and potentially hard-to-admit negative emotions or states. As mentioned above, WHO-5 has been proven effective for the detection of both depression risk [22,23] and the high levels of well-being [12]. Being a short, sensitive, specific and non-invasive tool, it gains over more detailed, but heavier methods for preliminary depression and suicide risk assessment in settings without psychological/psychiatric expertise. WHO-5 has shown high clinimetric validity and the ability to accurately predict a wide range of mental health conditions, including depression; moreover, it has been recommended as an outcome measure balancing the wanted and unwanted effects of treatments [24]. That is why WHO-5 has been adopted in many research fields such as suicidology, geriatrics, youth and alcohol abuse studies, personality disorder research, and occupational psychology [15,24].
Thus, WHO-5 and SWLS, being psychometrically sound screening tools with known outcomes, also measure complementary aspects of subjective well-being. Although measures of emotional affect and reported life satisfaction often correlate, substantial divergences have been found. For instance, almost half of the people who rated themselves as 'completely satisfied' also reported significant symptoms of anxiety and distress [17]. Therefore, quality of life in the current coronavirus crisis is usually measured with both scales [5,6,[25][26][27]: while WHO-5 helps to assess influence of different practices on SWB and the persistence of diminished well-being beyond and during COVID-19, SWLS shows how people feel and how their life perspective changes due to the pandemic. This complementarity indicates the importance of comparative research in prediction of both metrics.
This task is novel for SWB prediction with digital traces: despite the advances in detection of specific mental health problems and the attempts to predict some SWB metrics, no research so far has been dedicated to predicting WHO-5 and its comparison with SWLS in terms of digital behavior traces; moreover, most research is limited to English-speaking populations. Best models predicting SWLS with digital traces from social media, search engine and smartphone activity data demonstrate performance below 0.4 in terms of Pearson correlation -a well-known threshold for correlation between psychological characteristics and objective behavior [28,29] (see also [30,31] for an overview). None of the models combines language, social media and smartphone usage data.
The goal of this study is to predict individual WHO-5 and SWLS levels with a new combination of digital traces in a high-risk Russian-speaking population, to find out which features are the most predictive and what the overall predictive power of our models is. A high-risk population is defined as a population with a higher probability of having problematic levels of SWB, as compared to more general populations. We thus address a completely novel task of comparative prediction of two different aspects of subjective wellbeing, which should have different objective indicators and suggest different actions to be taken by the user. Additionally, we find out that depression risk in Russian-speaking population can be detected by the level of WHO-5 below a certain threshold as successfully as in the populations for which WHO-5 was tested earlier, and this allows us to predict the threshold as well. To do so, we make use of a sample of 372 psychological application users who have explicitly consented to share their private messages, social media data and mobile device usage traces. We use extensive feature engineering combined with regression and classification modeling, the first type of models being aimed at SWB score prediction, and the second -and depression risk identification based on theoretically justified thresholds. We also check our regression models against newest neural network approaches that, however, do not show sufficient quality at the dataset of our size.
The rest of the paper is structured as follows. In the next section we review the existing literature in prediction of SWB and related psychological and mental health phenomena with digital traces. Next, we describe our dataset, our numerous features and the approach to their engineering, as well as the models used. In the Results section we report our best models' performance and the most useful features. In the Discussion section we interpret our results and indicate the most important limitations. We conclude with the perspectives for future research.

Subjective well-being prediction
Prediction of internal psychological and mental states from objective behavior pattern is a highly difficult task [29,32]. Additionally, clinically diagnosed mental disorders (such as depression) and mental disorder risks assessed through threshold scores of screening tests (such as WHO-5) are different categories for prediction. While the former may be partially manifest, the latter, along with psychological traits and conditions, are latent constructs. This means that psychological theory does not expect them to fully correlate with any observable patterns since the former are not thought of as reducible to the latter in principle. This may be one of the reasons why such correlation is seldom high, although this is a subject for further research. As both high SWB and the absence of mental disorder symptoms have been shown to be components of mental health [12,33], prediction of both SWB and mental disorder (or its risk) constitutes two related tasks. However, due to the different nature of SWB and mental disorder as concepts, the former is usually evaluated with continuous predictive models, while the detection of the latter is most often formulated as a classification task.

Detection of mental disorders
A vast amount of studies predict specific mental health conditions with digital traces, mostly with the data from social media, such as Facebook and Twitter. The most widely analyzed conditions of such studies are depression and Post Traumatic Stress Disorder [34][35][36][37][38]. Other conditions include Bipolar Disorder, Anxiety and Social Anxiety Disorder, eating disorders, self-harm and suicide attempt [39][40][41][42]. Linguistic features used typically include word n-grams, sentiment, specific lexica (e.g., Linguistic Inquiry & Word Count dictionary, LIWC) and topic modelling, with other features related to social networks, emotions, cognitive styles, user activity and demographics [34][35][36][37][38][39]42]. Model evaluation metrics include Area Under the Curve (AUC), Precision, Accuracy of classification, and Correlation for continuous measurements. The results for binary mental health problem identification are high, reaching an AUC of 0.7-0.89, Precision up to 0.85, and Accuracy of 0.69-0.72 [30].
Ground truth information in such studies is obtained from different sources, leading to different quality. Most studies use either self-reported survey data [34,37] or self-declared mental illness [36,39]. The latter is prone to errors and bias induced by specific data collection methods.
In a recent study Eichstaedt et al. [38] effectively predict depression of Facebook users against medical records information. The authors use a 6-month history of Facebook statuses posted by 683 hospital patients, of whom 114 were diagnosed with depression (rate similar to the general population), and classify depression VS other medical diagnoses with an AUC = 0.72. Features of Facebook statuses include words and word bigrams, temporal characteristics of posting activity, metainformation on post length and frequency, topics and dictionary categories, with interpersonal, emotional and cognitive categories being among the best predictors.
The effects of smartphone usage on mental disorders, until very recently, have been mostly studied with self-reported data (see [43,44] for an overview). Meanwhile, smartphone apps that collect usage data provide an unprecedented opportunity to access objective and precise information on smartphone application usage. Hung et al. [45] find that phone call duration and rhythm patterns are predictive of negative emotions, while Saeb et al. [46] predict depressive symptom severity with geographical location and phone usage frequency information. However, as feature engineering with phone app usage data requires considerable time and effort [47], the potential of such data of psychological research is yet to be discovered.

Prediction of SWB levels
There have been a few studies aimed at predicting subjective well-being levels, mostly with regression, which obtain modest results. Individual and relational well-being was predicted from social network data [28,48] and from objective smartphone use data [49]. The reported results are close to the upper bound expected in this task: the meta-analytic correlation between digital traces and psychological well-being has been estimated as r = 0.37 across nine studies, including prediction of subjective well-being, emotional distress and depression [28]. The only study that reaches a higher correlation of 0.66 in one of the models [49] does not specify the scales used for measuring SWB; however, interestingly, it finds that while some apps predictably have a negative effect on well-being, others affect it positively.
Diener's SWLS, to our knowledge, has been predicted in only four studies that use digital traces in a cross-validated setting. In his pioneering study, Kosinski et al. [50] predicted SWLS with linear regression for 2340 Facebook users based on 58K 'Likes' -preferences of webpages indicated by the users. The Likes data dimensionality was reduced to top 100 components in a SVD model based on a larger dataset (58K users). The obtained correlation reached r = 0.17, whereas empirical test-retest correlation for SWLS was r = 0.44.
Collins et al. [51] predicted SWLS with Random Forest Regression and various Facebook features, including demographics, networking data, photos, likes, ground truth Big Five traits of the users, of their significant others and friends, and predicted Big Five as a proxy. The best result for a sample of 1360 users with Big Five features as a proxy reached the Mean Absolute Error (MAE) = 0.162, whereas the model with social network features produced MAE = 0.173 for SWLS. Unfortunately, no other evaluation metrics were reported in this study. Schwartz et al. [52] applied Ridge Regression to predict SWLS of 2198 individuals using their Facebook statuses. Thousands of linguistic features were extracted from the status texts, including 2000 topics obtained with the Latent Dirichlet Allocation topic modeling algorithm, word uni-and bi-grams, LIWC and sentiment lexica. A message-user level cascaded aggregation model was additionally trained on a disjoint dataset, which allowed to improve regression results from Pearson r = 0.301 to r = 0.333. Facebook status data were also used by Chen et al. [53]  There is a certain number of studies predicting SWB with app usage data. Some of them rely on self-reported measures of app use [54], while others collect objective data [49,55]. Correlation in David's model range from 0.31 to 0.66, however, the research does not specify the scales used for measuring SWB. At the same time, interestingly, it finds that while some apps predictably have a negative effect on well-being, others affect it positively. Gao and colleagues [55] report correlation from 0.34 for male users to 0.66 for female users in the task of predicting SWLS, however, they do not report the full feature set and the contribution of each feature in their best models. Instead, they mention that the most predictive variables are communication apps, certain types of games and the frequency of photo taking. None of these studies mentions cross-validation.
Overall, although the results of subjective well-being prediction are promising, several gaps in the existing research can be identified. First, WHO-5, which is an effective screening tool for depression risk and subjective well-being, has never been studied in a predictive research design. Second, all the studies predicting SWLS are limited to Englishspeaking populations and respective linguistic features. Moreover, these works only address Facebook digital traces, including profile, texts and likes. Finally, only scarce feature interpretation is reported in the previous studies, and digital trace manifestations of different well-being dimensions have never been compared.

Our approach
In this study, we set out to predict two different concepts of subjective well-being: one combining affective balance and life satisfaction (measured by SWLS index and further referred to as satisfaction-related SWB) and the other conceptualized as a reflection of mental health (measured by WHO-5 index and further referred to as mental SWB). For predicting well-being values, our task is defined as regression, while for detecting depression risk, we formulate our goal as a binary and trinary classification task. For the latter, we identify the threshold values of WHO-5 by validating them against the scores of the same users on the scales of depression, anxiety and stress, so that the WHO-5 values predicting these scores with the highest sensitivity and specificity are chosen. We perform our prediction of SWB on the texts of private messages, social media and smartphone usage information and perform regression and classification experiments in a cross-validated Machine Learning design. The novelty of the current study lies in the following: 1. We present the first study so far on predicting subjective well-being measured by WHO-5; 2. We find out a close association of WHO-5 thresholds with three scales of mental health which is promising in terms of extending our approach to the task of simultaneous prediction of a range of various mental health risks. 3. We are the first to compare satisfaction-based and mental SWB, analyzing their intersections and differences in terms of predictive features; 4. This is the first study to combine language, social media and phone app usage features in well-being research; 5. To our knowledge, our study is the first to address subjective well-being prediction in a Russian-speaking population and respective data: the Russian social network VKontakte and texts in the Russian language; 6. We use a dataset of a psychological application users, allowing us to predict subjective well-being in real-world conditions for a sample with high mental risks, which has never been done before;

Dataset
Our dataset was collected in collaboration with Humanteq social analytics company, using its DigitalFreud app (DF) -a Russian-language phone application for psychological selfassessment -promoted among Android-based smartphone users through Google Ads. Android was chosen as the basic operational system for data collection, as at the time of the app development and promotion its users constituted the majority (68-76%) [56] of Russian smartphone users who in turn were the app's target audience and who constituted 57-64% [57] of Russia's population. Additionally, the app was available to Russian speakers from any country, and although users from the countries other than Russia constituted the minority, none of the samples we further analyze is intended to be representative of Russia. Data collection via a psychological app of such type was used to access a high-risk population (its high-risk status was confirmed in subsequent comparison of its mean SWB to those in other populations, presented further below). Users were offered to take as many free tests as they wanted (including personality traits, depression, anxiety, stress, cognitive, motivation and SWB tests) and to explicitly consent to the access to their VKontakte profile data and/or smartphone use data. Based on the test results, users were offered psychological feedback and analytics on the use of VKontakte and/or their smartphones. On average, DigitalFreud users chose to fill in 1.5 questionnaires and shared varying subsets of their data, which made the overall dataset quite sparse.
Privacy policy included a clause stating that the data could be used for research. The study was approved by the HSE Ethics Committee; nevertheless, the data were anonymized prior to the analysis. No personal information (i.e. allowing to identify the users) was included in the sample. In particular, all the user profile ids were encrypted.
The initial sample included 2050 accounts of DigitalFreud users who have completed at least one of the two questionnaires of our interest: SWLS [10] or WHO-5 [58]. The vast majority completed either of the tests only once; for those who did it more than once, the earliest score was taken into our dataset.
The following digital traces data were available for the participants: • DigitalFreud profile data; • VKontakte user data; • Phone application data. Due to data sparsity, our final sample used in prediction contains digital traces by 372 users. The procedure of data cleaning that produced this dataset is given in Appendix 1. Thus the dataset is small because the data on well-being combined with personal digital traces is highly difficult to obtain, as it requires both considerable effort from a user on completing the questionnaires, and trust allowing them to share sensitive digital traces. However, our dataset is uniquely tailored to the task of predicting SWB in a high-risk population of mental health app users.
Additionally, there is a heldout dataset, which consists of messages written by 572 users, who lack other important features for prediction (demographics, phone app usage) but have text data. The heldout dataset is used for preliminary feature selection (see sections Words, Word clusters below). Before feature selection, texts were tokenized with happiestfuntokenizing 1 and lemmatized it with pymorphy [59].
The phone app dataset consists of phone application usage data by 992 users who lack other important features for prediction. The phone app dataset was used for preliminary phone application categorization and feature engineering.
We also collected a sub-sample of users (N = 417), who have completed the WHO-5 and at least one of the following questionnaires evaluating different mental health risks (mental health dataset): 1.  [62,63].
The mental health dataset was used in the WHO-5 classification task to select cutoff thresholds of the classes to be predicted, so the former would be representative of a range of mental health conditions.

Self-reported well-being measures
Satisfaction-related well-being scale (SWLS) The SWLS questionnaire was translated to Russian and validated by Ledovaya et al. [64].
The questionnaire contains 5 statements, each characterized by 7-point Likert scale ranging from 1 (strongly agree) to 7 (strongly disagree). The resulting SWLS score ranges from 5 (low satisfaction) to 35 (high satisfaction). The scale has good internal consistency: α coefficients ranging from 0.79 to 0.89. Test-retest coefficient, as already mentioned, ranges from 0.54 to 0.84 depending on the time lag between measurements (years or weeks, respectively) [21] and amounts to 0.78 in the Russian language version [64]. In our sample, 1727 accounts have information about the SWLS score.

Mental well-being scale (WHO-5)
We use the official Russian-language version of WHO-5 scale developed by WHO itself [58]. Each of WHO-5 items is scored on a 6point Likert scale ranging from 0 (at no time) to 5 (all of the time). The WHO-5 score ranges from 5 (absence of well-being) to 30 (maximal well-being).The scale has good Internal consistency: α coefficients ranging from 0.82 to 0.95 [13]. Test-retest coefficients are available for specific populations only and only in the short run ranging from 0.81 to 0.83 [65,66]. In our sample, 1791 accounts have information about the WHO-5 score.
Mental well-being classes As mentioned earlier, WHO-5, unlike SWLS, is indicative of a range of mental health conditions [24] and was directly designed to detect one of them [11]. Decisions of mental health, be it screening test results or medical diagnoses, are usually binary and point either at the absence or the presence of a disease. For such tasks scales need to be transformed into sets of discrete classes based on a certain threshold values. Such validated values exist for the original English-language WHO-5 scale (0.28 for major depression and 0.5 for depression). They are recommended for all nations and languages, but in fact have never been tested for the Russian-language population. Meanwhile, it has been shown that cultural differences matter in scale construction [67] and that, specifically, they complicate both mean WHO-5 comparison and threshold comparison across countries [15]. Therefore, we validated several thresholds ourselves. For this, we analyzed the mental health dataset of 417 DigitalFreud users who have completed both WHO-5 and one of the three questionnaires -on depression, anxiety and stress -and found the values of WHO-5 index best predictive of the classes of these three scales. This approach was our choice for two reasons: • the data on clinically diagnosed depression are absent from our dataset; • the three mentioned scales were validated for the Russian language and thus have been used here as the best available benchmarks. We tried out different WHO-5 thresholds to reach better sensitivity and specificity in representing the following conditions: PHQ/GAD ≥ 10 for depression and anxiety [68], and PSS ≥ 21 for stress [63]. Additionally, as from our earlier work [69] we know that classes derived from scale reduction might be better predicted in a trinary design in social science NLP tasks, we also experimented with three-class divisions.
Eventually, our analysis resulted in the following cutoff values of the normalized WHO-5 scale: • Binary cutoff = 0.51 with classes containing 221 and 151 users in the low and high SWB classes, respectively; • Trinary cutoffs = [0.35; 0.59] with classes containing 111, 158 and 103 users in the low, medium and high SWB classes. Table 1 illustrates sample statistics for each of the mental health conditions, and specificity and sensitivity in terms of the selected WHO-5 cutoff values. In our high-risk sample of mental health app users, the binary WHO-5 cutoff value 0.51 allows to reach high sensitivity across the analyzed mental health conditions, while preserving moderate specificity. The trinary cutoff values 0.35 and 0.59 allow to obtain low and high mental well-being classes with very high specificity.

Digital traces
DigitalFreud profile Account information about the DigitalFreud user includes encrypted DigitalFreud and VKontakte user ids, SWLS and WHO-5 scores, gender, birth year, education, employment and marital status, and date and time of the DigitalFreud app installation.
VKontakte user information Humanteq chooses to match DigitalFreud data with VKontakte data since the latter is the most popular social networking site in Russia. We use the following data obtained with VKontakte application programming interface (API): 1. User Profile data. Although VKontakte API provides access to potentially rich user information, in practice users seldom fill in their profiles, and the data is sparse. As a result, we only use gender, birthdate, and the number of friends and subscriptions in our analysis. 2. Wall posts (text, date and time, information on reposting with the original post contents and encrypted user id, number of reposts, comments and likes) available for 1871 users. 3. Directed private messages (text, date and time, encrypted author and addressee ids) available for 1044 users.
Phone application usage Phone application usage was monitored for one week following the initial consent obtained from the user when she started using DigitalFreud, which was consistent both with the app's terms of use and the policies of the Android platform. The collected information includes name and package of the application, start time and duration of the application usage in foreground in milliseconds. It is available for 992 users. In a few cases when the users quit the phone app data sharing before the end of the week, the recorded period was shorter.

Descriptive statistics
The main parameters of the descriptive statistics for our final dataset of 372 users are given in Tables 2 and 3. Our dataset is predictably skewed towards containing more females (80%) and young people (mean age 23 ± 5 y.o.) against 53% of females and the mean age of 39 y.o. in the general Russian population [70]. However, as it has been mentioned, this sample is not theoretically intended to represent Russia. Consistent with Collins et al [51], we normalize both well-being scores to the ranges between [0, 1]; to do so, we subtract 5 from both scores, then multiply SWLS values by 1/30, and WHO-5 values by 1/25.  Additionally, the distribution of the SWB and demographic data in the final dataset is illustrated in Appendix 2, Figs. 1-4. SWLS and WHO-5 intercorrelate strongly with r = 0.568, p < 10 -32. The level of internal consistency of both scales is high (Cronbach's α > 0.82).
Both SWB scores in our final sample are consistently lower than in other studies made on other groups of Russians. Thus, WHO-5 score amounts to the average of 0.46 ± 0.187 in our dataset against 0.60 ± 0.191 obtained in a study of Russian Facebook users [71], the only available evaluation of WHO-5 for Russia. Likewise, while the mean SWLS score among our participants is 18.3, a study on a sample close to the general Russian population (mean age 41 y.o. with 54% of women) shows the score of 23.6 [72]. A younger group of Russian students (mean age 20 with 65% of women) which is more similar to our sample scores even higher: 24.4 [73]. The lower SWB levels in our dataset are explained by selfselection of specific individuals to the DigitalFreud app: it naturally attracts users interested in seeking psychological and mental health information and advice, i.e., potentially more likely to have problematic mental health conditions. This is in line with our research goal of studying high-risk populations, of which our sample is an obvious example exactly due to the lower SWB scores.

Feature engineering
For our task of SWLS and WHO-5 prediction, we construct features of three main types: • User metadata and overall activity: demographics, DigitalFreud & VKontakte profile statistics, and overall phone app usage statistics; • Textual, or linguistic features: -Words; -Sentiment scores; -RuLIWC; -Word clusters; • Phone app usage statistics by app category. Overall, we constructed 660 features for SWLS and 651 for WHO-5. Most features were calculated as counts, ratios or counts by time period directly from the final dataset. However, words and word clusters as features were trained on the heldout dataset that does not intersect with the final dataset. Of these features, only those that correlated with the target variables were selected for the main experiments. In the main experiments, the features were submitted to the regression or classification models, which performed on the final dataset that was divided into train, development and test subsets in a 10-fold crossvalidation scenario. In this scenario, (1) multiple models were trained on the train set, (2) recursive feature elimination was performed on the development set based on MAE of the models, and (3) final scores for each feature type and each model were computed  based on the test set. More details on the main experiment procedure are given in the Machine Learning Experiments section.

User metadata and overall activity features
There are 40 features describing demographics, overall phone application usage data and the data on the overall activity patterns based on DigitalFreud and VKontakte profiles (see Table 4). The activity-related data include three groups of features: (1) numbers and volumes of personal messages written during one month preceding test completion, (2) numbers of alters, or accounts that a user has a message history with, for every user in each of the 12 months preceding test completion, and (3) weighted differences between the last two months in terms of the message volume and the number of alters. In building phone app usage features, we follow the previous research [74,75] which identified three-and six-hour periods of online activity to be significant markers of mental illness. In our research, we break phone app usage into three-hour periods of activity. Some features have been excluded from the analysis, due to data saprsity.

Linguistic features
Our extensive analysis of user texts has shown that VKontakte public wall posts are too sparse and include mostly web link content, which does not allow for effective prediction.
As a result, we construct all the linguistic features based on private messages written by the users in VKontakte messenger, mostly during one year preceding the installation of DigitalFreud app.

Sentiment scores
We use six features representing the proportions of positive and of negative words in the messages created during one month or one year preceding test participation, or in the entire messaging history of a user. Each feature represents the proportion, or l1-normalized frequency, of positive or negative sentiment words written in one of the three time periods (which results in 2 × 3 = 6 features). The sentiment words were identified with a closed-vocabulary approach based on the Russian sentiment lexicon RuSen-tiLex [76].

Words
We adopt the open-vocabulary approach to word features predictive of wellbeing [77]. Given the small size of our final dataset (372 observations), using all the frequent words as features (12K words with frequency ≥ 200) would inevitably result in overfitting. To overcome this and to select a reasonable number of interpretable features, we use the heldout dataset as follows: • First, a sub-sample of users who have filled both well-being questionnaires was selected from the heldout dataset (396 users); • Next, we selected 12.5K words occurring more than 200 times in the joint one-year long message collection of all users and calculated their TfIDF scores using 396 individual message collections as 396 texts for such calculation; • We filtered out words with p > 0.01 in the ANOVA tests relating these words to SWLS and WHO-5 values in the heldout dataset, which has resulted in the selection of 165 words for SWLS and 224 words for WHO-5 (see Appendix 3 for the full list). Words belonging to either of these sets (353 words) are used as features for prediction.  The main parameters of the resulting feature sets are described in Table 5.

Phone app categories and usage features
The phone app categories and usage features are based on the 1-week phone app usage history shared by the participants. App categories, or types were obtained from the phone app dataset data by using 53 app categories generated automatically from 28K app descriptions and by manually uniting them into larger groups as described in [47,49]. As a result, we identified the following nine app categories: Game, Education+Productivity, Tools, Entertainment, Personalization, Health+Medical, Social+Communication+Dating, Photography, covering 21.5K apps, with the rest 6.5K apps having been assigned to Other. The main app usage features were calculated as the total time devoted to a certain app category (e.g. Game, Photography or Other) in each of eight three-hour time slots of a day, averaged over all days of a given user (9 * 8 = 72 features), as well as overall time spent for this category in the entire app usage history of an individual (9 features). Next, we constructed several normalized versions of each feature. Namely, we normalized them by the total app usage time in this category, and by the total app usage logged in the current threehour period. This resulted in 9 + 72 * 3 = 225 features. The phone app category features are exemplified in Table 6.

Machine learning experiments
We performed specific experiments for each of our two subtasks: prediction of satisfactionrelated and mental well-being scales and prediction of the classes in the latter. As we aimed at interpretable results, our main experiments were based on classical regressions. Simultaneously, to make sure that we obtain the best possible prediction quality with the available contemporary methods, we also carried out extensive experiments employing deep learning approaches (described in Appendix 4). However, they yielded inferior results. The two main possible reasons for that are the following (1) our data are hard to obtain, and the obtained data are sparse and loosely intersect between users, which reduces the sample significantly; (2) our message data is hierarchically organized, with numerous alters with whom every participant communicates and numerous messages sent to every alter, while additionally the number of alters and messages highly varies between the participants/alters (see Table 3 above).
Our experiment on prediction of SWLS and WHO-5 scales was performed using a 10-fold cross-validation design with train, development and test sets (298/37/37 users, 80/10/10%). The non-overlapping train, development and test sets were constructed as follows: 1. The sample was shuffled and sorted by the well-being values; 2. The sorted sample was divided into 10 bins containing 37 users each so that bin i consisted of users with index = i + K * 37, where K varied in the range [0; 36]. Thus every bin was equally distributed in terms of the SWB values. 3. For ith cross-validation fold, bin i was used as the test set, bin i+1 -as the dev set, and the remaining users belonged to the training set. Our evaluation metrics for regression include Mean Absolute Error (MAE), Pearson r and R2-score. Hyperparameter values were chosen inside the cross-validation loop based on the results obtained from development by MAE values. Recursive Feature Elimination (RFE) was performed based on the development set to identify the informative features in each cross-validation fold. RFE was adopted based on the earlier experiments which had shown the increase in model performance with RFE. Additionally, RFE allows to select a small number of informative features, improving the model interpretability. The selected best hyperparameters and features were used to evaluate the quality of prediction on the test set inside the cross-validation loop. In the end, the evaluation metrics were averaged across all 10 folds.
Predictions of SWLS and WHO-5 scores were performed with seven regression models, including Linear Regression with various regularization techniques, Decision Tree, and two ensemble methods (see Appendix 5). WHO-5 classification was performed with three classification models based on our preliminary experiments (Appendix 6).
Classification of individual WHO-5 levels was performed in a binary mode with two classes (low VS high well-being) and in a trinary mode with three classes (low VS medium VS extremely high). The models and hyperparameter values are described in Appendix 6.
We report F1-macro and F1-weighted metrics over all the classes, as well as F1 metric for the lowest and the highest classes separately. We additionally report True Positive and False Positive Rates for the low well-being class, as these measures are typically used for screening test of various mental health conditions (cf. [38]).
All the calculations were performed in python with pandas, scipy, and scikit-learn libraries.

Prediction of well-being scale values
The continuous modeling results for the SWLS and WHO-5 well-being values are presented in Tables 7 and 8 Overall, the best feature set is words written by the users in messages, and the best model is ElasticNet.

Prediction of WHO-5 classes
The main classification results for the WHO-5 well-being are presented in Table 9. The full WHO-5 classification results are presented in Appendix 9.

Significant features
The features in the best performing continuous models of satisfaction-related well-being (SWLS) and mental well-being (WHO-5) scales are illustrated in Tables 10 and 11. Only the features which were selected by RFE in at least five out of ten cross-validation folders are included; the features significant in both SWLS and WHO-5 regression are highlighted in bold. All the significant features are listed in Appendices 10, 11.

Discussion
In this paper, we have introduced a novel task of predicting mental well-being measured by WHO-5 index, as compared to traditionally studied satisfaction-related SWLS, with digital traces, and performed it in both continuous modeling and classification designs.
In the latter, we have shown that the selected WHO-5 thresholds are representative of a range of three mental well-being-related conditions (depression, anxiety and stress) with high sensitivity and specificity. Furthermore, the results obtained in mental well-being classification are highly promising (0.792 True Positive Rate and 0.404 False Positive Rate) in the binary task with our highly sensitive threshold. This threshold is very close to the one recommended by WHO for moderate depression screening (0.51 against 0.50). The classification result itself is similar to the performance of the best existing models that predict other mental conditions with digital traces [30,38]. Likewise, our results of SWLS and WHO-5 scale prediction, with Pearson r = 0.402 and 0.367, respectively, improve the state-of-the-art metrics reported previously in similar tasks with cross-validation designs [51,53]. Since, as mentioned earlier, prediction of internal states with observable behaviors has its limitations [29,30], the obtained correlation may be considered high. As a result, we obtain a model which is highly sensitive and sufficiently specific for identifying low levels of subjective well-being requiring intervention in a high-risk population of mental health application users. Our model is unique not only in its accurate prediction of WHO-5 classes that have a proven ability of depression risk detection, but also in its potential to develop into a tool for broader screening for mental health risks, not limited to specific conditions reported in previous studies (see [28,30,48] for an overview). We have performed a unique comparison of regression models predicting both SWLS and WHO-5 indices on the same sample. Our best models for both indices show similar performance in terms of correlation and R2 metrics, but WHO-5 is predicted better in terms of MAE across all feature combinations; however, this is likely an outcome of different distributions of SWLS and WHO-5 in our sample (see Fig. 1, 2, Table 1

above).
Our design also allows us to compare the features predictive of life satisfaction-related SWB and mental SWB. Although our experiments have revealed only two highly predictive features that are common for both SWLS and WHO-5, they are highly interpretable in terms of psychological theory. These two metrics are (1) phone app usage time between 9 and 12 AM normalized by total app usage time, and (2) negative sentiment expressed in private messages in the last month, which have positive and negative coefficients, respectively, in both SWLS and WHO-5 tasks. Both of these findings confirm previous results obtained in various populations: participants affected by depression and other low SWB   conditions have been found less likely than average individuals to participate in online activities in the morning hours around 9-10 AM [74,75], while their circadian rhythms have been often disrupted [7]. Such disruption is what usually accompanies insomnia or hypersomnia, a symptom of the major depressive disorder listed in DSM-5 [83], the Diagnostic and Statistical Manual of Mental Disorders developed by the American Psychological Association. Negative sentiment has been shown to correlate negatively with life satisfaction [34,53,84] and subjective well-being [71]. Negative sentiment in written or oral speech may also sometimes, although not always, be a manifestation of depressed mood, another symptom of depressive disorder according to DMS-5.
Thus, these two highly predictive features intersecting in both SWLS and WHO-5 prediction models can indicate different degrees of SWB: from simple dissatisfaction with life, circumstances or personal achievements (relevant for SWLS), to a deterioration in mental or physical condition and serious symptoms of the depressive spectrum (relevant for WHO-5). They can be recommended for use across various SWB-prediction tasks.
Predictors unique for satisfaction-related well-being are much more dominated by verbal features related to affect-laden psychological and social content. They are often obscene lexemes, but also represent both negative and positive sentiment polarities (quit_VERB, spend_ VERB, fine_UNKN, explanation_NOUN, bully_VERB, spoil_VERB, gasp_ VERB, nice_COMPARATIVE). Association of positive lexica with SWB is consistent with Weismayer [85], who also finds negative relation of SWB with lexica expressing  ). Also, these lexica fit well with some of the ontologies developed for depression detection [45]. Prevalence of lexical features among SWLS predictors suggests that this index, indeed, captures subjective perception of well-being rather than symptoms of mental disorders, such as depression.
On the contrary, in mental well-being level prediction, phone app usage features take a clear lead, especially those related to the ratio of nighttime app usage (3-6 AM). Additionally, lexica related to biological processes are also a distinctive marker of low WHO-5 levels. All this aligns well with the primary goal of WHO-5 to reveal depression and its proved ability to differentiate between problematic mental health states and high levels of mental health-related well-being. Specifically, app usage rhythms and biological lexica are likely to be manifestations of such depression symptoms as increase or decrease in either weight or appetite, insomnia or hypersomnia, and fatigue or loss of energy [86]. At the same time, they can be markers of a poor physical condition, which is also detected by WHO-5 [18].
Finally, the significance of negative sentiment in the long periods of messaging (1 year and longer) for WHO-5 levels suggests that mental SWB measured by this index might in fact have a more stable behavioral pattern than SWLS. However, there is also a possibility that the stable component of SWLS is underrepresented in our features or subjects. Simultaneously, it may be that not only SWLS (as shown in [21]), but also WHO-5 contains both stable and transient components that may be explained by different factors. While the temporal stability of SWB may be expected to be related to constant individual features, such as presence of a chronic disease, SWB volatility, on the contrary, should be explained by short-term mood fluctuations and long-term meaningful changes in life, such as those listed in the introduction. Individual predictors of SWB stability and volatility may differ for SWLS and WHO-5, and it may happen that in our sample the feature set is skewed in favor of WHO-5 stability factors. In any case, our analysis of the overlapping and the differing predictors for WHO-5 and SWLS shows that satisfaction-related SWB and mental SWB share some of their transient factors rather than stable ones. These preliminary observations of the temporal dimension of SWB set a promising direction for future research.

Conclusions
The growing interest in tracking human mental states and in the development of mindfulness leads to the growth of applications that screen or even diagnose mental conditions and offer solutions for their correction, including those based on objective data. Our research has shown that it is possible to create machine learning models based on interpretable traits and predict various aspects of subjective well-being at the state-of-the-art level.
In doing so, we have performed the first study on predicting subjective well-being measured by  We have demonstrated that certain WHO-5 level thresholds are indicative of a range of mental health conditions prevalent in a sample characterized by high risk of mental health problems. We have obtained promising results in classification of mental SWB into classes constructed based on these thresholds. This approach has allowed us to identify individuals affected by low subjective well-being with high recall and reasonable false positive rates, based on their digital traces.
Our study is also the first to compare prediction performance and predictive features of mental SWB and satisfaction-related SWB. We show that several predictors are shared by well-being measured by both WHO-5 and SWLS, and these digital traces are bluntly indicative of overall (un)well-being. At the same time, digital traces distinguishing between WHO-5 and SWLS are closely related to the conceptual difference between these two indices: while SWLS is characterized by expressions denoting affect-laden psychological and social content, WHO-5 levels are manifested in objective features reflecting physiological functioning and somatic conditions, i.e., lexica related to biological processes and circadian rhythm-related ratios of phone app usage.
To our knowledge, this is the first approach to subjective well-being prediction in a Russian-speaking population, and the first to combine language, social network and phone app usage features in well-being research. By leveraging phone app usage logs, profile and message data from the Russian social network VKontakte, we have been able to improve prediction of satisfaction-related SWB (SWLS) and propose a first predictive model for mental SWB (WHO-5). At the same time, as our sample has been very small and limited to a high-risk population, the study needs replication on larger samples representative of wider social and psychological groups. The major obstacle to this is that VKontakte private message data are no longer available for any type of download, while other social media are even more restrictive. Development of public policies and regulations encouraging private data-collecting companies to share portions of their data for public good purposes is highly recommended.

Appendix 1: Data preprocessing
The dataset was cleaned: • First, birth year and gender were identified from DF and VK profile data. We removed data where age or gender were not available, or contradicted between DF and VK; • We only selected users having non-zero information on phone app usage, number of VK friends and at least 100 characters written in messages in the month immediately prior to DF installation. This left us with user 446 accounts; • We removed duplicates of VK and DF id from the data, giving priority to the data profiles which included more information filled in, and to profiles which were characterized by a later DF install time.

Demographic data description
Total sparse data sample includes information about 1960 users. 17.6% of them (344) do not provide information about the country. The rest of the sample (1616 users) contains 84.4% of Russian users, 5% -Belarus, 2.4% -Ukraine, 1.9% -Kazakhstan, 1.1% -USA. We also have users from the countries below, but their frequencies are not higher than 1%: Japan, South Korea, Moldova, Germany, Great Britain, Canada, Finland, Kyrgyzstan, Italy, Argentina, Norway, Israel, Cyprus, China, Vatican, Honduras, India, Serbia, Latvia, Liechtenstein, Iceland, Uzbekistan, Hungary, Georgia, Denmark, France, Ivory Coast, Cook Islands, Estonia, Australia, Romania, Netherlands, American Samoa, Albania, Gambia. Information about the city of the current sample is absent in 30% of cases (578), but the distribution of ten most common cities for the rest of the users (1382) is described in Table 12.
The final dataset contains 372 users: 15% of them (57) do not provide information about the country. The rest of the sample (315 users) contains 92% of Russian users, 2.5% -Belarus, and 1.3% -Ukraine. There are also users from the countries below (each has less than 1%): Kazakhstan, USA, Latvia, China, Finland, Norway, Japan, Hungary, South Korea, Canada. Information about the city of the current sample is absent in 26% of cases (98), but the distribution of ten most common cities for the rest of the users (274) is described in Table 13.

Appendix 4: Preliminary deep learning experiments 4.1 RuBERT
First, we performed experiments with RuBERT models [Kuratov & Arkhipov 2019] based on post and message data. The difficulty in applying BERT-like models in our textual data lies in the fact that BERT model input is limited with max. 512 sub-tokens; at the same time, posts and messages in Vkontakte can be much longer and don't have a small character limit (as it is, for example, in Twitter). This results in 2 issues, which have to be solved to apply RuBERT to our data: • input sequences should be truncated to 512 sub-tokens maximum; • input sequences by the same user should be aggregated. Solving these issues is not a trivial task for VKontakte posts and messages for the following reasons: • Posts and messages have different length, they can be much longer than 512 sub-tokens; • The numbers of posts, messages and message alters for every user vary a lot; • The rhythm of posting/messaging varies a lot for every user: while active during one month, a user can have no posts or messages written in the previous 6 months; Post/message information aggregation involves pooling of the individual RuBERT model results, which means basically averaging information between the range of posts/messages by a user, whereas a lot of information is lost. Due to these reasons, we performed most of our RuBERT-based experiments with posts, which, due to their smaller numbers, are easier to aggregate in the RuBERT models. We used data by 902 users with at least 10 posts. We fed each post into one of the RuBERT models [Kuratov & Arkhipov 2019] after truncation. After the RuBERT model, we used a variety of additional layers. Regression was always performed by the final Dense layer. The experiment hyperparameters included the following: • Using RuBERT as an embedding layer or fine-tuning it for the regression task; • The models included: RuBERT, Conversational RuBERT, Sentence RuBERT; • We included all users (902), and those having at least 50 messages (222); • We used the train/dev/test 5-fold cross validation; • We included up to 64 posts by each user truncated to 128 sub-tokens each; • We also aggregated the latest posts by each user and truncated the result to 512 sub-tokens; • We used the full RuBERT output or the last 'class' token; The layers after the RuBERT models were:

Sentiment analysis with RuSentiment BERT
As it was mentioned before, Chen et al. [2016] used sentiment analysis to predict SWLS; we also performed experiments with user messages to assess sentiment. The idea is that distribution over sentiment classes can be used as features for predicting subjective wellbeing levels. Their many different approaches to classifying messages by sentiment. One of them is to use word dictionaries with sentiment marks. However, it has two important disadvantages: the sentiment of a word can be changed by the context of its use; it is not clear which label should we assign to messages with many words of different sentiment (especially if they are distributed evenly inside the message). These disadvantages lead us to use another common approach for sentiment classification. We used a pre-trained neural network. We found an open-source model with BERT architecture [Devlin J. et al., 2018] which was trained to define the sentiment of VKontakte posts. To be more precise, this model is a result of fine-tuning multilingual BERT with linear head on top using the RuSentiment dataset [Rogers A. et al., 2018] on five classes ("neutral", "negative", "positive", "speech act", "skip") classification task. Using a held-out dataset we subsample users who provide access to their messages. We created a dataset with around 400 users containing messages which were written by them in the last three months before they achieved a WHO score. By providing an exploratory data analysis we found that 10 per cent of users have less than 30 messages, so we cut off these samples. The resulting dataset has 354 samples where each user on average has 4,719 messages (median: 2,415). We normalize the frequency of each sentiment class using the overall number of messages corresponding to a user.
First, we check the correlation between the sentiment classes and WHO score. Table 15 shows that there is no strong correlation.
We also construct a pipeline with a regression model on top of this frequency distribution with different feature combinations, but the models do not show promising results (Table 16).
We assume that achieved results can be explained in the following way. The domain of VKontakte message texts can be different from the domain of VKontakte post text. First, because posts can be interpreted as a complete (finite) phrase, but not a message, which should be interpreted inside the dialogue context. A separated message can have not enough information to classify its sentiment. The absence of dialogue boundaries (when a user starts one dialogue session and finishes it inside a long thread) does not allow us to reconstruct context for a message, which possibly can help to gain a more accurate sentiment classification.

Appendix 5: Models and hyperparameters used for SWLS and WHO-5 regression
See Table 17.

Appendix 6: Models and hyperparameters used for WHO-5 classification
See Table 18.